home *** CD-ROM | disk | FTP | other *** search
- Path: po.CWRU.Edu!mab22
- From: mab22@po.CWRU.Edu (Michael A. Balfour)
- Newsgroups: comp.lang.c
- Subject: Re: unique id for a string
- Date: 22 Feb 1996 19:37:19 GMT
- Organization: Case Western Reserve University, Cleveland, OH (USA)
- Message-ID: <4giglf$99h@madeline.INS.CWRU.Edu>
- References: <1996Feb16.175601.114182@kuhub.cc.ukans.edu> <harmon.824680556@pegasus.montclair.edu>
- Reply-To: mab22@po.CWRU.Edu (Michael A. Balfour)
- NNTP-Posting-Host: kanga.ins.cwru.edu
-
-
- In a previous article, harmon@pegasus.montclair.edu (Derek Harmon) says:
-
- >anh@kuhub.cc.ukans.edu writes:
- >>This is not a quite a C problem. But since the implementation will be in
- >>C anyway.
- >
- >>Another way is to simply assign a id to a word and store the word in a
- >>binary search tree, and therefore, it is relatively fast to make sure
- >>each word has an unique id. But I think it is still to slow. We are
- >>talking about a possible 200,000+ words.
- >
- > How balanced the tree would be depends on either the order of input, or
- >the tree's self-balancing of itself which makes input longer, but logarith-
- >mically lowers the bound on search time. But we are talking about more than
- >200,000 words here, and that is where the problem occurs. If the max length
- >of these strings is 32 characters, then we are talking about roughly 7.3 x
- >10^43 different strings. That will not fit into a 64-bit long. :)
- >
-
- Another possibility to approach is a hash table. A method I've been
- examining for doing searches on strings instantly without even storing
- the list of strings involves minimal perfect hash functions. The
- general idea is that every string you're using as input hashes to a
- specific unique value. Using this technique, you just hash the string
- to be searched on, and use the hash value to index into an array of
- return values.
-
- A good paper on the topic can be found at ftp.cs.uq.oz.au. The paper is
- "An optimal algorithm for generating minimal perfect hash functions"
- written by Zbigniew Czech, George Havas, and Bohdan Majewski.
-
- Mike Balfour
- > The formula you give above will give all those strings distinct id nums,
- >if you have knowledge of the probability of occurence of symbols from that
- >language, you can shift the probability of smaller id num size (there will
- >still be large id nums) for a random sampling of those strings. However,
- >if symbols have equal probabilities of selection, finnagling of the formula
- >won't add any advantage. If it were possible, the world would be a path to
- >your door for an infinite data compression method.
- >
- > -- Stone
- >--
- ># Derek Harmon (aka Stonelight) harmon@pegasus.montclair.edu
- ># - Computer Science Undergrad, Montclair State University, NJ
- ># - My views are my own, nobody else is this creative. 3;)>
- >... Recursive. adj., see Recursive.
- >
- >
-
- --
- ----------------------------------+--------------------------------
- Mike Balfour, Partner | BS/MS Graduate - ECMP
- Overload Engineering | Case Western Reserve University
- "New Ideas for a Brighter Future" | Cleveland, OH
-